This project was aimed to find Which chemical properties influence the quality of red wines. The data set was created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
# Load the Data
wineQualityReds<-read.csv("wineQualityReds.csv",na.string="",row.names=1)
dim(wineQualityReds)
## [1] 1599 12
str(wineQualityReds)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
names(wineQualityReds)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Let’s check missing values.
apply(wineQualityReds,2,function(x){sum(is.na(x))})
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
No missing values here.
quality<-wineQualityReds$quality
summary(quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
pie(table(quality),main="Pie of Red Wine Qualities")
out<-boxplot.stats(quality)$out
out
## [1] 8 8 8 8 8 3 8 8 8 3 8 3 8 3 3 8 8 8 8 8 3 3 8 8 3 3 3 8
Most red wines have qualities ranging from 5 to 7, the average quality is 5.64, the best wines are ranked 8 and the worst wines are ranked 3 in this dataset, and they are regarded as outliers, but we still keep them for further analysis.
fixed.acidity<-wineQualityReds$fixed.acidity
summary(fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
qplot(fixed.acidity,binwidth=0.2)
Most values concentrate between 6 and 11.The data is right skewed. Let’s look at the density.
qplot((fixed.acidity),data=wineQualityReds,geom="density",color=factor(quality))
It seems the distributions in Quality 7 and 8 are different, perhaps that’s the reason of being skewed.
qplot(factor(quality),fixed.acidity,data=wineQualityReds,geom="boxplot")
The medians do not show clear tendancy.
volatile.acidity<-wineQualityReds$volatile.acidity
summary(volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
qplot(volatile.acidity,bins=30)
qplot(volatile.acidity,geom="density")
qplot(factor(quality),volatile.acidity,data=wineQualityReds,geom="boxplot")
It seems volatile.acidity has a negative impact on quality in terms of medians. Most values concentrate bewteen 0.25 and 0.8.
citric.acid<-wineQualityReds$citric.acid
summary(wineQualityReds$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
qplot(citric.acid,bins=30)
qplot(citric.acid,geom='density')
The data is right skewed, and the density has many peaks.
qplot(citric.acid,color=factor(quality),geom='density')
The distribution varies with quality.
qplot(factor(quality),citric.acid,data=wineQualityReds,geom="boxplot")
sum(wineQualityReds$citric.acid==0)
## [1] 132
It seems citric acid has positive influence on the quality of red wines. Note: 132 citric.acid values are 0, perhaps the amounts are to small to detect.
residual.sugar<-wineQualityReds$residual.sugar
summary(residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
qplot(residual.sugar,bins=30)
The data has a long tail, perhaps due to outliers, let’s remove the outlier.
out<-boxplot.stats(wineQualityReds$residual.sugar)$out#Outliers
length(out)#number of outliers
## [1] 155
qplot(residual.sugar,binwidth=0.1) + scale_x_continuous(limits=c(1,5))
Now it seems more symmetric.
qplot(x=factor(quality),y=residual.sugar,geom="boxplot",ylim=c(1,4))
There are many outliers within each quality. And it seems no clear tendancy between residual.sugar and quality.
chlorides<-wineQualityReds$chlorides
summary(chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
qplot(chlorides,bins=30)
The data has a long tail, obviously due to outliers.
qplot(chlorides[chlorides<0.15],bins=30)
After removing the outliers, the distribution tend to be symmetric.
qplot(factor(quality),chlorides,data=wineQualityReds,geom="boxplot")
qplot(factor(quality),chlorides,data=wineQualityReds,geom="boxplot",
ylim=c(0,0.2))
## Warning: Removed 41 rows containing non-finite values (stat_boxplot).
There are also many chlorides outliers within Quality 5 and 6. There’s no clear tendancy.
free.sulfur.dioxide<-wineQualityReds$free.sulfur.dioxide
summary(free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
qplot(free.sulfur.dioxide,bins=30)
The data is a little right skewed.
out<-boxplot.stats(wineQualityReds$free.sulfur.dioxide)$out#Outliers
length(out)
## [1] 30
qplot(log(free.sulfur.dioxide),bins=25)
qplot(factor(quality),free.sulfur.dioxide,data=wineQualityReds,geom="boxplot")
Log transformation does not make it better, and there’s no clear tendancy between quality and free sulfur dioxide.
qplot(free.sulfur.dioxide,color=factor(quality),geom="density")
total.sulfur.dioxide<-wineQualityReds$total.sulfur.dioxide
summary(total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
qplot(total.sulfur.dioxide,bins=30)
The data is also right skewed.
out<-boxplot.stats(total.sulfur.dioxide)$out
length(out)
## [1] 55
qplot(log(total.sulfur.dioxide),bins=30)
After log transformation, the data turns to be somehow symmetric.
qplot(factor(quality),total.sulfur.dioxide,data=wineQualityReds,geom="boxplot")
No clear tendancy. In ordert to analyze bound form sulfur dioxide, we create a new variable bf.sulfur.dioxide.
wineQualityReds<-mutate(wineQualityReds, bf.sulfur.dioxide=total.sulfur.dioxide-free.sulfur.dioxide)
bf.sulfur.dioxide<-wineQualityReds$bf.sulfur.dioxide
qplot(bf.sulfur.dioxide,bins=30)
qplot(log(bf.sulfur.dioxide),bins=30)
Still right skewed. After log transformation, the data turns to be somehow symmetric.
density<-wineQualityReds$density
summary(density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
qplot(density,bins=30)
var(density)
## [1] 3.562029e-06
The data looks symmetric. The values are quite centered 0.9967, and the variance is small.
qplot(factor(quality),density,data=wineQualityReds,geom="boxplot")
Density has a negative impact on quality
pH<-wineQualityReds$pH
summary(pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
qplot(pH,bins=30)
The data looks symmetric.
qplot(factor(quality),pH,data=wineQualityReds,geom="boxplot")
It seems pH has a negative impact on quality.
sulphates<-wineQualityReds$sulphates
summary(sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
qplot(sulphates,bins=30)
The data looks a little right skewed. Let’s cut the tail.
qplot(sulphates[sulphates<1],bins=30)
Still right skewed.
qplot(factor(quality),sulphates,data=wineQualityReds,geom="boxplot")
It seems sulphates have positive impact on quality.
alcohol<-wineQualityReds$alcohol
summary(alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
qplot(alcohol,bins=30)
The data is right skewed.
qplot(log(alcohol),bins=30)
Log transformation does not work well here.
qplot(alcohol,data=wineQualityReds,geom="density",facets=.~quality)
It seems the alcohol distributes differently along with different qualities. Perhaps that’s the reason why it is skewed.
qplot(factor(quality),alcohol,data=wineQualityReds,geom="boxplot")
It seems alcohol has a positive impact on the quality in terms of median, coincident with the density plot above.
The dataset consist of 1599 rows and 12 columns. There are 1599 red wines, 11 numeric features(“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH” ,“sulphates” ,“alcohol” ,“quality”) and 1 integer outcome(quality), The range of quality in this dataset was 3-8(from worst to best). The average of red wine quality was 5.636, and the median was 6, most of red wine qualities are between 5 and 7. Furthermore, fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol are right skewed from histograms, and residual.sugar, chlorides both have outliers which influence their ditributions to some extent.
According to the boxplots above, quality, volatile.acidity, citric.acid, density, pH, sulphates and alcohol are main features, because they seem to have either positive or negative relationship with quality. Exactly, citric.acid, sulphates and alcohol seem to have positive impact on quality, whereas the other main features have negative impact.
I think fixed.acidity, chlorides, residual.sugar, total.sulfur.dioxide, bf.sulfur.dioxide, these features can help me investigate into main features because they do not have missing values.
I created a bf.sulfur.dioxide feature based on total sulfur dioxide and free sulfur dioxide. These two sulfur may have different contribution to the qualities of red wine. Actually , the log transformations of total.sulfur.dioxide and bf.sulfur.dioxide have better shapes in terms of ditribution.
Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol are right skewed from histograms, from the density plots, they have different ditributions along with different qualities. I used log transformation and outlier removal methods to make their distributions more balanced. If I remove outliers from sulphates and chlorides, they tend to have better hitograms,. Also, the log transformation works for total.sulfur.dioxide and bf.sulfur.dioxide.
wineQualityReds2<-mutate(wineQualityReds,log.total.sulfur.dioxide=log(total.sulfur.dioxide),log.bf.sulfur.dioxide=log(bf.sulfur.dioxide))
wineQualityReds2<-select(wineQualityReds2,-total.sulfur.dioxide,-bf.sulfur.dioxide,-free.sulfur.dioxide)
We made log transformations on total.sulfur.dioxide and bf.sulfur.dioxide, and created wineQualityReds2.
Group the dataset by quality, calculate median and var for each feature.
wineQualityReds.group<-group_by(wineQualityReds2,quality)
s<-summarize(wineQualityReds.group,citric_md=median(citric.acid),
sul_md=median(sulphates),
alc_md=median(alcohol),
vol_md=median(volatile.acidity),
pH_md=median(pH),
citric_var=var(citric.acid),
sul_var=var(sulphates),
alc_var=var(alcohol),
vol_var=var(volatile.acidity),
pH_var=var(pH)
)
s
## Source: local data frame [6 x 11]
##
## quality citric_md sul_md alc_md vol_md pH_md citric_var sul_var
## (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 3 0.035 0.545 9.925 0.845 3.39 0.06283222 0.01488889
## 2 4 0.090 0.560 10.000 0.670 3.37 0.04041321 0.05730806
## 3 5 0.230 0.580 9.700 0.580 3.30 0.03240095 0.02926229
## 4 6 0.260 0.640 10.500 0.490 3.32 0.03806730 0.02516967
## 5 7 0.400 0.740 11.500 0.370 3.28 0.03780388 0.01839791
## 6 8 0.420 0.740 12.150 0.370 3.23 0.03981046 0.01331242
## Variables not shown: alc_var (dbl), vol_var (dbl), pH_var (dbl)
Just as the boxplots indicates above, good red wines tend to contain more citric.acid, sulphates, alcohol, and less volatile.acidity as well as pH.
s1<-gather(s,feature,median,citric_md:pH_md)
qplot(quality,median,data=s1,facets=.~feature)+geom_line()
s2<-gather(s,feature,var,citric_var:pH_var)
qplot(quality,var,data=s2,facets=.~feature)+geom_line()
Let’s have a look at the relationship between main features of interests.
pairs(volatile.acidity~citric.acid+pH+density+sulphates+alcohol,data=wineQualityReds2)
cor(select(wineQualityReds2,volatile.acidity,citric.acid,pH,density,sulphates,alcohol))
## volatile.acidity citric.acid pH density
## volatile.acidity 1.00000000 -0.5524957 0.2349373 0.02202623
## citric.acid -0.55249568 1.0000000 -0.5419041 0.36494718
## pH 0.23493729 -0.5419041 1.0000000 -0.34169933
## density 0.02202623 0.3649472 -0.3416993 1.00000000
## sulphates -0.26098669 0.3127700 -0.1966476 0.14850641
## alcohol -0.20228803 0.1099032 0.2056325 -0.49617977
## sulphates alcohol
## volatile.acidity -0.26098669 -0.20228803
## citric.acid 0.31277004 0.10990325
## pH -0.19664760 0.20563251
## density 0.14850641 -0.49617977
## sulphates 1.00000000 0.09359475
## alcohol 0.09359475 1.00000000
It seems there is certain relationship between volatile.acidity and citric.acid, citric.acid and pH.
g<-ggplot(wineQualityReds2,aes(volatile.acidity,citric.acid))
g<-g+geom_point(alpha=1/10)+geom_smooth()
g
Let’s check the linear relationship.
g+geom_smooth(method="lm")
cor.test(volatile.acidity,citric.acid,method="pearson")
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
According to the correlation value, volatile.acidity does have a negative impact on citric.acid.
g<-ggplot(wineQualityReds2,aes(citric.acid,pH))
g+geom_jitter()+geom_smooth(method="lm")
cor.test(citric.acid,pH,method="pearson")
##
## Pearson's product-moment correlation
##
## data: citric.acid and pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5756337 -0.5063336
## sample estimates:
## cor
## -0.5419041
pH has negative impact on citric.acid.
pairs(fixed.acidity~chlorides+residual.sugar+log.total.sulfur.dioxide+log.bf.sulfur.dioxide,data=wineQualityReds2)
cor(select(wineQualityReds2,fixed.acidity,chlorides,residual.sugar,log.total.sulfur.dioxide,log.bf.sulfur.dioxide))
## fixed.acidity chlorides residual.sugar
## fixed.acidity 1.00000000 0.09370519 0.11477672
## chlorides 0.09370519 1.00000000 0.05560954
## residual.sugar 0.11477672 0.05560954 1.00000000
## log.total.sulfur.dioxide -0.11789982 0.06022193 0.14747141
## log.bf.sulfur.dioxide -0.04820219 0.08794616 0.15160657
## log.total.sulfur.dioxide log.bf.sulfur.dioxide
## fixed.acidity -0.11789982 -0.04820219
## chlorides 0.06022193 0.08794616
## residual.sugar 0.14747141 0.15160657
## log.total.sulfur.dioxide 1.00000000 0.94150455
## log.bf.sulfur.dioxide 0.94150455 1.00000000
It seems that log.total.sulfur.dioxide and log.bf.sulfur.dioxide has very strong linear relationship. Therefore, in later discussion, we only talk about log.total.sulfur.dioxide.
g<-ggplot(wineQualityReds2,aes(log.total.sulfur.dioxide,log.bf.sulfur.dioxide))
g+geom_point()+geom_smooth(method="lm")
No clear relationship between volatile.acidity and other features.
qplot(fixed.acidity,citric.acid,geom=c("smooth","point"))
qplot(chlorides,citric.acid,data=wineQualityReds2,geom=c("smooth","point"))
qplot(residual.sugar,citric.acid,geom=c("smooth","point"))
qplot(log.total.sulfur.dioxide,citric.acid,geom=c("smooth","point"))
It seems fixed.acidity has positive relationship with citric.acid.
cor.test(citric.acid,fixed.acidity,method="pearson")
##
## Pearson's product-moment correlation
##
## data: citric.acid and fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
And the correlation value proves our assumption.
qplot(fixed.acidity,pH,geom=c("smooth","point"))
qplot(chlorides,pH,geom=c("smooth","point"))
qplot(residual.sugar,pH,geom=c("smooth","point"))
qplot(log.total.sulfur.dioxide,pH,geom=c("smooth","point"))
It seems fixed.acidity has negative relationship on pH.
cor.test(pH,fixed.acidity,method="pearson")
##
## Pearson's product-moment correlation
##
## data: pH and fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
The correlation value is over 0.6.
No obvious relationship.
qplot(fixed.acidity,density,geom=c("smooth","jitter"),alpha=1/10)
qplot(chlorides,density,geom=c("smooth","point"))
qplot(residual.sugar,density,geom=c("smooth","point"))
qplot(log.total.sulfur.dioxide,density,geom=c("smooth","point"))
It seems fixed.acidity has positive relationship with density.
cor.test(fixed.acidity,density,method="pearson")
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
The correlation value is over 0.6.
No clear relationship.
Just as the boxplots indicates above, good red wines(high qualities) tend to contain more citric.acid, sulphates, alcohol, and less volatile.acidity as well as pH in terms of medians. Alcohol has a relatively large variance with each quality, partly because the features have not been normalized. From the scatterplots above, it seems volatile.acidity increases as citric.acid increases, and pH has negative impact on citric.acid. Besides, citric.acid and density grows as fixed.acidity increases, whereas pH is at a low while fixed.acidity increase, their correlation are above 0.6 which shows quite strong linear relationship.
log(bf.sulfur.dioxide) has a very strong positive linear relationship with log(total.sulfur.dioxide). So we can remove one of them for further discussion.
According to the correlations and plots, fixed.acidity has strong relationship with citric.acid, density as well as PH, their correlation coefficients are above 0.6. The fixed.acidity can be removed because it is dependent of other features, also bf.sulfur.dioxide can be removed.
g<-ggplot(wineQualityReds2,aes(volatile.acidity,citric.acid))
g+geom_jitter()+geom_smooth(method="lm")+facet_grid(.~quality)
g<-ggplot(wineQualityReds2,aes(pH,citric.acid))
g+geom_jitter()+geom_smooth(method="lm")+facet_grid(.~quality)
It seems the quality does not impact the tendancy between volatile.acidity and citric.acid, pH and citric.acid. Because citric.acid are correlated with volatile.acidity and pH, we do not consider citric.acid as an independent feature.
As there are only a few high-quality or bad-quality red wine samples, we can cut the qualities into three categories:“bad”,“good”,“excellent”.
wineQualityReds2<-mutate(wineQualityReds2,quality_join = cut(quality,
breaks=c(2,4,6,8),labels=c("bad","good","excellent")),
alcohol_join=cut(alcohol,breaks=c(8,9.5,10.2,11.1,15),
labels=c("lower","low","high","higher")))
qplot(volatile.acidity,data=wineQualityReds2,geom="density",color=quality_join)
qplot(sulphates,data=wineQualityReds2,geom="density",color=quality_join)
qplot(pH,data=wineQualityReds2,geom="density",color=quality_join)
pH distribution doesn’t vary much with quality.
qplot(volatile.acidity,sulphates,data=wineQualityReds2,
geom="jitter",alpha=1/10, color=quality_join)
qplot(pH,sulphates,data=wineQualityReds2,
geom="jitter",alpha=1/10, color=quality_join)
It seems good qualities gather in lower volatile.acidity and higher sulphates or higher pH.
qplot(volatile.acidity,sulphates,data=wineQualityReds2,
geom="jitter",alpha=1/10, color=quality_join,facets=alcohol_join~.)
More alcohol, more sulphates and less valatile.acidity tend to make good wine.
qplot(volatile.acidity,sulphates*pH,data=wineQualityReds2,
geom="jitter",alpha=1/10, color=quality_join,facets=alcohol_join~.)
qplot(volatile.acidity,sulphates*chlorides,data=wineQualityReds2,
geom="jitter",alpha=1/10, color=quality_join)
qplot(pH/sulphates,volatile.acidity,data=wineQualityReds2,
geom="jitter",alpha=1/10, color=quality_join)
These interactions do not improve the classification efficient.
library(tree)
## Warning: package 'tree' was built under R version 3.2.5
tree.fit<-tree(quality_join~volatile.acidity+sulphates+alcohol+pH,
data=wineQualityReds2)
plot(tree.fit)
text(tree.fit,cex=0.7)
summary(tree.fit)
##
## Classification tree:
## tree(formula = quality_join ~ volatile.acidity + sulphates +
## alcohol + pH, data = wineQualityReds2)
## Number of terminal nodes: 10
## Residual mean deviance: 0.8083 = 1284 / 1589
## Misclassification error rate: 0.1538 = 246 / 1599
The accuracy is over 80%, and those features do prove our previous analysis.
tree.fit<-tree(quality_join~.-fixed.acidity-quality-citric.acid,
data=wineQualityReds2)
plot(tree.fit)
text(tree.fit,cex=0.7)
summary(tree.fit)
##
## Classification tree:
## tree(formula = quality_join ~ . - fixed.acidity - quality - citric.acid,
## data = wineQualityReds2)
## Variables actually used in tree construction:
## [1] "alcohol" "sulphates"
## [3] "log.total.sulfur.dioxide" "volatile.acidity"
## [5] "pH"
## Number of terminal nodes: 11
## Residual mean deviance: 0.7954 = 1263 / 1588
## Misclassification error rate: 0.1538 = 246 / 1599
The other features do not contribute much to the classification.
We transformed the red wine quality decision problem into a classification problem. We cut the qualities into three categories, and selected several strong main features based on the previous analysis, and explored how they decided the quality of red wine with each other, even with interactions. Also we built a tree model to analyse main features and other features, which reinforced our early analysis.
From the figures and tree model discussed above, volatile.acidity, sulphates and alcohol, they strengthened each other in deciding the quality of red wine, actually they are also main features as we discussed in early univariate and bivariate analysis, but pH seemed not help much in classification. And we tried to explore other features’ influence in classification, such as chlorides and log.total.sulfur.dioxide, they do not paly important roles in the model.
I haven’t found any interesting interactions between features.
I created a classification tree model to explore the relationship between quality and other features. The strengths are that it is easy to understand and explain the model from the tree plots, the quality are decided by the nodes of other features. Besides, tree model can work very quickly on new dataset as well. The limitations is that tree model may be too flexible, which can cause overfitting in training dataset, the accuracy can be more ideal if we use a more proper model such as random forest.
The values of Citric Acid are right skewed.
Some features are quite correlated with others, i.e., fixed.acidity has certain linear relationship with citric.acid.
From the plot, it is clear that more alcohol, more sulphates and less valatile.acidity tend to make good wine.
The wineQualityReds dataset has 1599 red wines whose qualities range from 3 to 8.I started by analyzing each features in the dataset through histograms, boxplots and summary, and then I explored the relationship between features and outcome through scatterplots. Eventually, I transformed the problem into a classification one and explored the quality of red wine across many variables and created a tree model to predict qualities. I found features volatile.acidity, sulphates and alcohol have important impacts on quality, whereas the other features such as fixed.acidity, pH do not play important roles in terms of accuracy. In addition, features such as fixed.acidity, citric.acid, free.sulfur.dioxide are quite dependent on other features, so I did not consider them in later discussion. For the tree model, I tried to find how the significant features decided quality, finally I only kept volatile.acidity, sulphates and alcohol as siginificant features. Some limitations of this model were, The limitations is that tree model may be too flexible, which can cause overfitting in training dataset, the accuracy can be more ideal if we use a more proper model such as random forest. Considering that most of red wines have qualities ranging from 5 to 7, there are just a few samples for bad and top wines, perhaps the model will do better if the dataset are balanced.